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Abstract 

Fault-tolerant  computer  systems  change  their  level  of 
performance  (e.g.,  mode  of  operation  or  service  rate)  m  response  to 
different  events  such  as  failure,  degradation  or  repair.  We  present 
a  unified  model  for  the  analysis  of  yob  (task)  completion  time  and 
the  accumulated  service  (reward)  until  a  given  time  (also  known  m 
performability).  In  prior  work,  the  evaluation  of  the  distribution 
of  performability  was  restricted  to  nonrepairable  systems 
(represented  by  acyclic  Markov  chains).  In  this  paper,  we  describe 
an  algorithm  for  the  numerical  evaluation  of  the  distributions  of 
performability  or  job  completion  time,  in  repairable  fault-tolerant 
systems  (represented  by  cyclic  Markov  chains).  We  demonstrate 
the  feasibility  of  our  techniques  by  means  of  numerical  examples. 

Keywords.  Degradable/Repairable  Systems,  Markov  Reward 
Processes.  Numerical  Methods,  Performability  Modelling,  Task 
Performance.  Task-Oriented  Reliability. 

1.  Introduction 

The  increased  reliability  requirements  of  present  day  systems 
have  caused  fault-tolerant  and  degradable  systems  to  become  more 
important.  For  these  systems,  it  ts  important  to  introduce 
measures  that  reflect  both  performance  and  reliability  of  tbe 
system.  Several  authors  have  developed  models  for  the  evaluation 
of  reliability,  performability  and  program  performance  (e.g..  task 
completion  time  measures).  This  paper  ts  an  attempt  to  unify 
these  different  models  with  a  single  model  which  is  useful  for 
assessing  the  behaviour  of  degradable/ repairable  computer 
systems. 

As  pointed  out  by  Meyer  jl5],  distributed  and  multiple 
processor  systems  are  generally  characterised  by  three  mam 
features'  concurrency,  fault-tolerance  and  degradable  performance, 
furthermore,  real-time  systems  must  also  possess  tbe  timeliness 
pro*. -Tty.  Traditional  system-onented  reliability /availability 
models  have  covered  the  fault-tolerance  aspect  (19;  Job-oriented 
rehabibty  models  have  catered  to  the  fault- tolerance  and 
timeliness  aspects  simultaneously  i2.nl.  System-onented 
performability  models  have  included  both  fault- tolerance  and 
degradable  performance  in  system  evaluation  15.7,10,14:  The 
unifying  aspect  of  this  paper  is  that  Unit-tolerance,  degradable 
performance  and  timeliness  are  addressed  simultaneously 
Concurrency  and  timeliness  issues  art  addressed  elsewhere  18 

In  the  model  we  develop,  changes  m  the  structural  stale  of 
the  computer  system  caused  by  different  events  are  desenbed  by  a 
stochastic  process  (referred  to  as  the  structure-state  process) 
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Associated  with  each  structure  state  is  a  reward  rate  (e.g..  service 
rate  or  throughput).  It  is  particularly  useful  for  our  unifying 
analysis  to  consider  the  execution  of  a  particular  job  on  the 
system.  In  this  erne  the  reward  rate  represents  the  service  rate 
(eg,  the  number  of  instructions  executed  per  unit  time)  It  is 
obvious  that  tbe  completion  time  of  the  job  is  affected  by  tbe 
preemption*  and  the  possible  variations  in  the  service  rate  due  to 
changes  in  the  structure  state  of  the  system.  If  the  job  servn  a 
always  resumed  after  preemptions,  then  tbe  completion  time  of  a 
given  job  and  the  cumulative  service  (measure  or  reward)  until  a 
given  time  are  dual  measures  so  that  the  distribution  of  one  allots 
us  to  compute  the  distribution  of  tbe  other.  This  is  a  key 
observation  because  tbe  analysis  of  tbe  completion  tune  yields  tbe 
distribution  of  the  cumulative  measure  (accumulated  reward  or 
performability)  which  can  be  appropriately  specialised  to  obtain 
different  system-oriented  measures  as  will  be  shown  in  Section  2. 

Howard  (9j  studied  the  expected  accumulated  reward  until  a 
given  time  in  Markov  and  semi-Markov  reward  processes.  Other 
authors  {5,7,14)  have  studied  the  distribution  of  the  accumulated 
reward  in  acyclic  reward  processes,  under  the  assumption  that 
preemptions  do  not  resuit  in  a  loss  of  work.  Iyer  et  al.  10 
considered  the  evaluation  of  the  moments  of  the  accumulated 
reward  in  cyclic  Markov  reward  processes  In  J2.13  w»  ha'  e 
considered  the  analysis  of  tbe  job  completion  time  in  the  presence 
of  different  types  of  preemptions,  with  the  possibility  of  kiss  of  ail 
work  (in  which  case  the  job  has  to  be  repeated).  In  the  present 
paper,  the  distribution  of  the  accumulated  reward  in  cyclic  and 
acyclic  Markov  reward  processes  is  derived  as  a  special  case  of  the 
more  general  analysis  of  job  completion.  We  give  an  algorithm  for 
the  numerical  evaluation  of  tbe  distributions  cf  the  job  completion 
time  and  performability  measures  in  repairable  fault-tolerant 
systems.  The  work  presented  in  this  paper  is  a  significant 
contribution,  since  the  earlier  work  was  restricted  to  nonrepairable 
systems. 

In  Section  2,  we  introduce  the  mathematicai  model,  some 
definitions  and  notations  The  transform  solution  of  tbe  completion 
time  and  tbe  cumulative  measure  ts  derived  in  Section  3  An 
algorithm  for  the  numerical  evaluation  of  these  measures  ts 
presented  in  Section  4  In  Section  5.  we  give  nontrivial  numerical 
examples  to  demonstrate  tbe  use  of  the  solution  technique  and  the 
feasibility  of  tbe  algorithm  developed. 

2.  The  Basic  Model  and  Definitions 

Consider  a  particular  job  to  be  processed  on  a  given 
computer  system  Tbe  work  requirement  of  the  job  ts  a  random 
variable  B  .  and  is  measured  in  work  units  (e  g..  the  number  of 
instructions  to  be  executed)  It  has  tbe  distribution  function 
G(i)  «=  P{B  <  })  and  the  LST  (Laplace  Stieltjes  Transform i’ 
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G  (i )  — :  Efc'*8 ).  It  is  assumed  that  G  (0+)  —  0. 

The  stochastic  process.  [Z(t),t  >  0},  which  describes  the 
behaviour  of  the  system  in  time  (the  structure-state  process  I  is  a 
time-homogeneous  continuous-time  Markov  chain  ( CTMC ).  Z(t) 
ts  the  state  of  the  system  at  lime  t .  This  stochastic  process  is 
assumed  to  be  independent  of  the  work  requirement  of  the  job.  At 
any  given  time,  the  system  can  be  in  one  of  a  states.  In  state  i 
the  system  serves  the  job  at  a  rate  r,-  >  0,  1  <  i  <  n  .  The  set  of 

states  1,2 . n  is  canonically  partitioned  into  k  + 1  sets;  namely, 

St  ,Sc\tScii— <$a .  *t>ch  that  5y  is  a  aet  of  transient  states  and 
Scy  ,  1  <  i  <  k  ,  is  a  closed  set  of  recurrent  states.  (If  the  system 
enters  a  closed  set  of  states,  it  st  ys  there  forever.)  A  recurrent  set 
is  called  a  “failure”  set  if  the  reward  rate  is  equal  to  sero  m  ail  its 
states,  otherwise  it  is  called  a  “nonfailure”  set.  If  the  system 
enters  a  failure  set,  it  stays  there  and  offers  no  more  service 
(system  failure).  On  the  other  hand,  if  the  system  enters  a 
nonfailure  set,  it  stays  there  and  the  job  will  eventually  complete. 

Let  ,  1  <  « ,/  <  a  ,  s  yd  j ,  be  the  infinitesimal 

transition  rate  from  state  i  to  state  j . 
Q  —  ( ?,-/  J ,  1  <  i  ,j  <  «  ,  is  the  a  by  »  generator  matrix,  where 

S 

fii  —  “  9i  —  —  S  9ii  • 

J-l 

ll*i 

Note  that  row  sums  of  Q  are  equal  to  sero  (i.e.,  Qa  *»  Q,  where  £, 
is  the  n  by  1  vector  with  all  elements  equal  to  one  and  is  the  a 
by  1  lero  vector). 

Now  let  us  introduce  some  important  performance  measures 
that  will  be  used  throughout  the  paper. 

Cumulative  measure,  y(f),  *  the  total  reward  gained  in  all 
structure  states  until  time  t  (in  this  paper  we  also  refer  to  it  as 
the  accumulated  reward  or  performability),  i.e., 

HO-/  >■*(»)  dk  .  y  (—  him  r(0)  “  the  cumulative 
o  1  ~~<K 

me  '  until  system  failure  ]1.20j.  V (1 )  can  be  specialised  to 
dc  '  following  job-  and  system-oriented  measures. 


System  reliability,  R(t):  let  X  be  the  time  until  system  failure, 
i  e.,  the  time  until  the  structure-state  process  enters  a  failure  set  of 
states  If  we  set  r,  *1  for  all  states  t  that  are  not  contained  in  an' 
failure  set  of  states,  then 

R(t)-P(X  >  t)-LimP(Y(r)  >  /). 

f-OO 

Total  “up"  or  “dears"  tine  «n til  time  f,  U{t)  or  D(t):  the 
system  is  said  to  be  “up”  if  it  is  in  a  state  i  with  r,  >0.  otherwise 
it  ia  aaid  to  be  “down”.  U(t )  (or  D  (t ))  is  defined  to  be  the  total 
time  the  system  spends  in  “up”  (or  “down”)  states  until  tune 
where  X  is  the  system's  life  time.  Clearly  if  we  set 
all  r,-  >0  to  1,  then 

P(U(t)<x)^P[Y[t)<x) 

and,  since 

£/(!)+  £>(!}  —  min  {t  Jf)  , 
it  follows  that 

P(D{t)  <  *)  -  P(Y[t)  >  min  {<  PC}  -  x)  . 

Interval  avail  a iiity,  A/(l ),  is  defined  to  be  the  fraction  of  time  the 
system  spends  in  “up”  states  in  the  interval  (O.f ).  i.e.. 
A/(0  —  £7(1  )/f  -  The  distribution  of  the  interval  availability 
has  been  a  subject  of  recent  investigations  (8j. 

In  Section  3.  we  derive  double-transform  equations  for  the 
distributions  of  the  job  completion  time  and  the  cwmulairre 
measure,  and  in  Section  4  we  describe  an  algorithm  for  the 
numerical  evaluation  of  these  distributions.  In  the  remainder  of 
this  section  we  introduce  some  notations  that  will  be  used  later. 
Define  the  distribution  functions 

Fi{t,x)-  P{T{x)  <  t  |Z(0)-  i),  i  >  0.  1  <  i  <  ,. 

F(t*)-P(T(x)  <  <).  x  >  0, 

F,(‘)~  P(T  <  t  |  Z(0)  -  ,•),  !<■<«. 


Job  completion  time,  T(x),  n  the  time  needed  to  complete  a  job 
whose  work  requirement  is  x  units  of  work.  T  denotes  the 
completion  time  of  a  job  that  requires  a  random  amount  of  work, 
B  .  Since  Y[t )  represents  the  useful  work  done  on  the  job  until 
time  ( .  it  is  a  nondecreasing  function  and  has  piecewise  continuous 
paths.  It  follows  that  r(x ) —■  min(f  >0:  y(f)»«x}  and 
T  «=  minjl  >0:  Y{t  )—£  }  .  The  analysis  of  the  job  completion 
time  has  been  considered  for  special  caaes  in  (2,6, 16]. 

Probability  of  omitaien  /Wire,  g(x ),  is  defined  to  be  the 
probability  that  the  system  fails  before  the  completion  of  a  job 
that  requires  x  units  of  work.  Thus 

i?(x)  -  P(Y(t)  <  x,  for  all  t  >  0)  -  P(T(z)  =  oc) 

If  t?  denotes  the  probability  that  the  system  fails  before  the 
completion  of  a  job  with  random  work  requirement,  then 

rj  =  P{Y(t)  <  B.  for  all  t  >  0)  =  P(T  -  oo)  . 

A  related  measure  ts  the  dynamic  failme  probability  in  real-time 
systems.  For  a  hard  deadline  d .  it  ts  given  by  i)  »=  PIT  >  d ). 
and  is  readily  obtained  from  the  distribution  function  of  T . 


’  C  )  denoiei  ihe  LST  i*..  lie  Lxpisrr  tnastorm  of  i  probibi'u*  dttstr 
function.  »nd  E  (.)  n  me  experuuon  operator 


/•(f)-P(T  <  t) 

and  the  1ST  'a 

Ff'(».*)  — £(e-*r**>|  E(0)~i),  x  >  0,  1  <  i  <  a.  (2.1. 
From  the  independence  of  {Z(t  ),t  >  0}  and  B  it  follows  that 
F'(0j)  =  E(c-’Tl*>) 

-  E  F,'(as)P(Z{0)~i),  x  >0.  (.. 

»— l 

F,' (•)-£( '~T  I  2(0)=.  ) 

OC 

—  /  F,  (i  ,x )  dG  (x ),  1  <  f  <  n  , 

o 

F'(s)«£(c-*r)=  £  F,' (a)P(Z(0)=i). 

•  -i 

The  omivaoo  failure  probability,  rj,  follows  from 
1  —  P{T  *  oc)  “  1  -  Lim  F  (s  ) . 

i  —0 


(-  3 

Cf 
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3.  The  Transform  Solution 

We  pve  the  tranjfortn  solution  for  the  distribution  of  the  job 
completion  time  in  Theorem  3.1.  In  Theorem  3.2  we  present  a 
useful  dust  relationship  between  the  cumulative  measure  and  the 
completion  time. 


l-«  /  e—  P(T(z)  <  t)  ii  . 

0 

Multiplying  by  e-*1  and  integrating  with  respect  to  1 .  Equation 
(3.3)  follows.  Q.E.D. 


Let  us  first  define  the  following  transforms  5 

00 

y («,()-£( e-r("),  y '(«.«)-/ «-*  y(«,o*  . 

o 


F' E{c~t  '*»),  F'  *(♦,«)-/«■"  F'(«,x)dx  . 

o 

The  following  notations  will  be  used 

£'*(♦.«)-  \f;  •(« ,« ),  f3  •(« ,«)ir, 

r  *(«  .* )  -  \r;  *(« .* ),r;  •(«  ,s ) . y,  *(«  ,i  jp, 

t  —  (''i,rfc...,r.]r  and  R  —  iitg  (r„rs . r,  j  , 

where  the  superscript  T  denotes  transpose. 


Theorem  3.1: 

The  double  transforms  Ft  \t ,» ),  1  <  i  <  n  ,  satisfy  the 
following  equations 


F,'  *(«  .tO- 


ri 


9.7 


+  E 

«+?,  +r,  «  t  -rf.-cr, 

/ri> 


f;  x « ,« ). 


(3.1) 


Since  the  matrix  [*/  +  aft  -  9)  is  invertible.  Equations  (3.1)  can 
be  written  in  a  matrix  form  as  follows 


£'•(.,«)-(♦/  +  «ft  -Q\-'z  (3.1a) 

where  /  is  the  identity  matrix.  The  proof  is  given  in  (12|. 

The  following  theorem  presents  a  useful  dual  relationship 
between  the  cumulative  measure  at  a  given  time  and  the 
completion  time  of  a  given  job.  As  a  result,  knowing  the 
distribution  of  T(r)  allows  us  to  determine  the  distribution  of 
Y(t).  Consequently,  system-oriented  measures  such  as  system 
reliability,  interval  availability  and  others  can  be  determined  by 
appropriately  specialising  the  reward  rates  in  different  structure 
stales  (as  discussed  in  Section  2). 


Theorem  3.!: 

The  distribution  function  of  the  cumulative  measure,  Tfi), 
is  related  to  the  distribution  function  of  the  completion  time 
T(x),  as  follows 

P(Y(t)<  *)-  1  -P(T(x)  <  I)  (3.2) 

and  the  corresponding  double  transforms  are  related  as  follows 

y  *(«,,)«  i(i-  '(.,«)].  (3.3) 

Proof: 

It  is  clear  that 

?()•(/)<  r)  -  P[T(z)  >  I)  . 

since  these  are  the  probabilities  of  two  identical  events,  and 
Equation  (3.2)  follows  Multiplying  both  sides  of  Equation  (3.2)  by 
t and  integrating  with  respect  to  x  .  we  get 


1  (*)  dtnoitj  tht  Laplace  transform  of  a  function 


We  can  rewrite  Equation  (3.3)  in  a  matrix  form  as  follows 

X'  '(«,*)-[./  +  «*  .QH  i  (3.3a) 

where  we  made  use  of  Equation  (3.1a)  and  the  relation  Qi  —  Q 
Equation  (3.3a)  was  derived  in  (17j  using  a  different  approach. 


4.  An  Algorithm  for  Numerical  Computations 

In  this  section  we  consider  the  evaluation  of  the  distribution 
function  of  the  cumulative  measure, 

y(s,0- P(y(<)  <  *  |z(o)-o,  i  <  •  <  »  • 


from  Equation  (3.3a). 

Let  A  »•[*/+  aft  -  Q  ),  then  by  Cramer's  rule  we  have 
Af,-(s,s) 


*“{«  .*)' 


1  <  i  <n 


(4.1) 


ft(.,s) 

where  £(«,*)  is  the  determinant  of  A,  and  N,  (a  ,a  )  is  the 
determinant  of  A  with  the  i th  column  replaced  by  s.  Clearly. 
jD(s  ,i  )  and  N,-(a  ,«  )  are  polynomials. in  a  and  t .  Therefore  for 
a  fixed  value  of  a  ,  Yi  *(a  ,*  )  is  a  rational  function  in  a  .  say 

N,(t)/D(t).  Once  the  roots  «i(a),«2(a) . s,  («)  of  the 

polynomial  D(t J  are  determined,  we  obtain  the  partial  fraction 
expansion  of  Y{  *(a  ,» )  and  then  invert  analytically  with  respect 
to  »  .  These  roots  are  precisely  the  eigenvalues  of  IQ  -  aft ),  and 
are  determined  numerically  using  orthogonal  transformations  and 

the  QR  algorithm  [21].  Let  Z>  (•  )  be  given  by  XI  (*  ~  •>  (*  ))"'  • 

;'-i 

where  each  of  its  distinct  roots:  namely,  «y(a  ),  1  <  ;  <  d .  has 
multiplicity  .  Then  we  can  write  the  partial  fraction  expansion 
as  follows 


Y,  *(»’«)  —  yjnj  E  E  «.;>(«)(«  -«;(“))**  (4  2) 

1  i—i  »-i 

We  then  cbooae  a  values  of  r  that  are  not  too  close  to  any 
«,(a),  1  <  ;  <  i,  and  substitute  w  Equation  (4.2).  From  the 
resulting  linear  system  of  equations,  the  . a  unknowns  «,,»(«  )  are 
uniquely  determined.  Now  we  invert  Y,  ,i )  analytically  with 
respect  to  t ,  to  get 


*■(«.«) 

Once  we  have 
as  follows. 


-  E  E  irwti']e',")‘  («i 

;~l  *-l 

Yi  («  ,< ),  we  invert  numerically  with  respect  to  a 


For  notational  simplicity  we  let  Y,  («  .< )  be  V ( w  )  and 
Y,  (x  ,f )  be  v  (x ).  This  will  avoid  the  confusion  between  the 
subscript  i  and  the  radical  i  •»v'^T  that  we  need  below.  We 
secure  the  inverse  of  V(«  )  with  the  inversion  formula 

3-H00 

*(*)=;r-T  /  <*  V'(«)d«  .  (4.4) 

«-»oc 


Now  let  f  (x )  bi  a  periodic  function  whose  period.  2r.  i<  the 
interval  of  interest,  such  that  f(x)  =  e'm  r(x).  0  <  x  <  2r 
The  parameter  s  is  chosen  so  that  fix)  is  a  bounded  function 
An  approach  to  approximating  r  (x )  over  the  interval  0  <  x  <  2r 
is  to  obtain  the  Fourier  series  expansion  of  the  function  f  (x  ) 
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•(*) -  —  (»'(«)  +  E  (Re(^(«  +t«/r))eo«(*K/r) 

r  t-i 

-Im(V(«  +  h  *t/r))  sin(i  itx  /r)  j  ]  .  (4.5) 

The  above  senes  is  approximated  by  the  first  m  terms.  It  has 
characteristically  slow  convergence,  but  it  can  be  accelerated  by 
continued  fraction  methods  such  as  Wynne’s  <  algorithm,  or  the 
quotient-difference  algorithm  of  Rutishauser  with  a  remainder 
estimate  suggested  by  De  Hoog  (3j;  we  use  the  latter  approach. 

We  now  give  the  algorithmic  structure  for  the  computation 
of  Y,  (x  ,t )  for  a  system  with  a  states. 

A:  for  (  m  values  of  s  ) 

determine  «/(«  )  (  QR  algorithm  )  0  (« *) 

B:  for  (  a  values  of  a  ji  (a  ),  1  <  /  <  i  ) 

determine  Af,-(«  ,» )  (  Evaluate  determinant  )  0  (a1) 
determine  s,-y»  (a  )(  Solve  linear  system  )  O(a’) 

C:  for  (  f  different  values  of  t  ) 

D:  for  (  m  values  of  a  ) 

{  Evaluate  (4.3)  for  a  given  time  t  }  0(a) 

E.  for  (  q  different  values  of  x  ) 

{ 

Evaluate  finite  approximation 

to  (4.5),  for  a  given  (x  ,t )  pair  O  (m  ) 


Since  the  number  of  repetitions  of  the  inner  loop  B  is  0  (a  ), 
clearly  the  algorithm  requires  0(a*m)  time  to  compute  y,(x,i) 
for  a  given  (x ,( )  pair.  Because  most  of  the  computational  effort 
occurs  before  step  C,  additional  values  of  F,  (x,l)  can  be 
computed  cheaply.  Y,  (x,<)  may  be  evaluated  for  q  additional 
values  of  x  at  an  increase  of  only  0  (qm  )  computation  time,  since 
just  loop  E  must  be  recomputed  for  tbe  new  values  of  x .  To 
obtain  F,  (x  ,t )  for  a  different  value  of  I  and  f  different  values  of 
x  requires  only  that  loop  D  be  performed  m  times  and  loop  Z  be 
performed  q  times.  The  computational  burden  for  the  new  t 
value  is  thus  0(mn  +  mq ).  For  example,  if  we  wish  to  obtain 
the  values  of  T,  (x  ,< )  for  f  values  of  t  and  q  values  of  x  for  each 
value  of  t  (as  on  a  rectangular  grid)  then  the  fq  values  ccrild  be 
determined  in  Ofa^m  ■+•  mp(»  -r  f )  )  time.  It  ahould  be  noted 
that  the  storage  requirement  a  independent  of  the  number  of 
(x  ,t )  pair*  for  which  Y;  (z  ,t)  m  evaluated.  The  malm  c/pervticns 
in  loop  B  require  0(n5)  storage.  The  *,-,»(«)  evaluated  for  the 
0(mn)  values  of  «;-(«)  are  needed  to  perform  loop  C,  and  m 
values  of  %(«,()  evaluated  from  (4.3)  are  required  in  loop  E. 
Hence  the  total  space  require  Bent  is  0(as+«m)  Accurate 
results  are  readily  obtained  when  80  terms  art  used  to  approximate 
(4.5)  (m  »»  80).  In  the  next  section  we  give  two  numerical 
examples  illustrating  the  feasibility  of  the  above  algorithm 

5.  Examples 

First,  we  consider  a  fault-tolerant  multiprocessor  system  with 
finite  buffer  stages  A  similar  two-processor  system  (without  repair) 
was  considered  by  Meyer  [14],  and  wae  extended  fcy  Iyer  el  al  10 
to  include  repair  In  f  10]  they  desonbe  a  numerical  algorithm  to 
compute  the  moments  (rather  than  the  distribution)  of 
perform&bility  In  our  example  we  use  the  numerical  techmoue 


described  above  to  obtain  the  distribution  of  performabihty.  For 
N  processors  and  i  buffer  stages,  tbe  system  is  modelled  as  an 
M  /M  /N  /N  +  l  queueing  system.  Jobs  arrive  at  rate  A  and  are 
lost  when  the  buffer  is  full.  The  job  service  rate  is  ©.  Processors 
fail  independently  at  me  X  and  are  repaired  singly  with  rate  ft 
Buffer  stages  fail  independently  at  rate  7  and  are  repaired  singly 
with  rate  r.  Processor  failure  causes  a  graceful  degradation  of  the 
system  (the  number  of  processors  is  decreased  by  one).  The  system 
is  in  a  failed  state  when  all  processors  have  failed  or  any  buffer 
stage  has  failed.  No  additional  processor  failures  are  assumed  to 
occur  when  the  system  is  in  a  failed  state.  The  model  is 
represented  by  a  CTUC  with  the  state- trxnsitioo  diagram  shown 
in  Figure  1.  At  any  given  time  the  state  of  the  system  is  (1 ,j  ) 
where  0  <  f  <  N  is  the  number  of  nonfailed  processors,  and  ;  a 
tero  if  any  of  the  buffer  stages  is  failed,  otherwise  it  is  one.  An 
appropriate  reward  rate  in  a  given  state  is  the  steady-state 
throughput  of  the  system  with  the  given  number  of  nonfailed 
processors  (the  throughput  formula  is  a  well  knows  result  {19}). 
The  reward  rate  is  aero  in  any  system  failure  state. 

We  evaluate  the  distribution  of  performabQicy,  Y(t ),  given 
that  the  system  started  with  all  its  processors  and  buffers 
operational,  for  utilisation  period  of  10  hours.  The  nun  ber  of 
processors  s i  eight,  each  with  a  failure  rate  X  ”  091  per  week  and 
a  repair  rate  ft  —  0.1606  per  hour.  The  in  dividual  buffer  stage 
failure  me  7  —  0.22  per  week  and  its  repair  rate  r  —  0.1660  per 
hoar.  Jobs  arrive  at  me  A  “  170  per  hour  and  the  service  rate  for 
a  single  processor  it  6  -»  20  jobs  per  hour.  In  Figure  2  we  plot 
the  duaribcUoc  of  performahility  for  different  numbers  of  buffer 
stages.  We  observe  that  fewer  buffer  stages  provide  a  lower 
maximum  accumulated  reward  but  a  more  favourable  distribution 
of  Y(t)  (i-e.,  lower  -slues  of  P(Y(t)  <  x),  far  a  given  x  less 
than  the  maximnm  possible  reward).  Tbe  nan  time  of  our 
algorithm  for  this  example  (with  an  underlying  Markov  chain  of  16 
stales)  on  a  VAX/750  is  100  secoods.  The  distribution  of  tbe  “up" 
time,  U(t ),  can  similarly  be  evaluated  by  setting  the  reward  rates 
m  all  nonfailed  states  to  one.  The  complementary  distribution  of 
tbe  interval  availability,  Af{t )  (  ■»  U{t ),/),»  plotted  in  Figure  3 
for  different  numbers  of  buffer  stages.  Tbe  interval  availability . 
Ai  (t ),  is  lower  for  more  buffer  stages,  this  rs  due  to  the  increase  m 
the  total  buffer  failure  me  (notice  that  with  more  buffer  stages  the 
interval  availability  is  cot  effected  by  the  increased  reward  rotes  m 
“up”  states).  The  reliability  of  tbe  system.  R(t),  can  be 
determined  by  disallowing  repair  from  all  system  failure  states  and 
evaluating  the  complementary  distribution  of  tbe  "up”  time  for 
infinite  utihsauoo  period.  This  is  plotted  in  Figure  4  for  different 
numbers  0 1  buffer  stages. 

Let  at  now  consider  an  example  to  compute  the  distribution 
of  the  job  completion  time  on  a  two-processor 
(degradable/repairable)  system.  The  system  is  subject  to  total 
failure  (due  to  imperfect  coverage  (19,20;  or  exhaustion  of 
processors).  The  processor  failure  rate  is  7  “  0— •  the  processor 
repair  me  ia  r  —  4.0  and  the  coverage  factor  u  c  «  0.99.  Tbe 
CTMC  representing  the  system  ia  shown  in  Figure  5.  in  which  the 
states  D  J  and  0  represent  a  system  with  two  one  and  no 
operational  processors,  respectively.  It  is  further  assumed  that  tbe 
job  can  be  divided  into  parallel  subtasks,  so  that  if  both  processors 
art  operational  the  service  rate  is  increased  by  a  factor  r . 
1  <  r  <2.  Let  the  service  me  in  slate  /  be  r-  *  1,  then  tbe 
service  rave  in  state  U  ts  177  ■«  r.  We  choose  r  *  1.6.  Consider 
the  execution  of  a  job  with  work  requirement  equal  to  x  on  the 
system.  In  Figure  6  we  compare  tbe  distribution  of  the  job 
completion  ume.  when  executed  on  different  systems,  namely .  a 
single  processor  system,  a  two-processor  system  wun  and  without 
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repair  In  other  words,  we  study  the  effect  of  redundancy  and 
repair  ui  fault-toierant  systems  The  favourable  effect  of 
redundancy  a  obvious.  and  the  improvement  resulting  from  repair 
a  clearly  significant  (this  u  reflected  m  a  reduced  job  completion 
time  and  a  higher  probability  of  successful  job  completion;  the 
latter  is  the  asymptotic  value  of  the  distribution  function  of  the 
completion  time).  In  Figure  7  we  show  the  effect  of  the  coverage 
factor  in  a  two- processor  system  with  and  without  repair.  It  a 
interesting  to  remark  that  the  probability  of  auecessfuJ  job 
completion  a  more  sensitive  to  variations  in  tbe  coverage  factor  in 
tbe  presence  of  repair  than  in  the  absence  of  repair.  The 
favourable  effect  of  reper  .  again  obvious. 

(.  Coaduaioos 

In  this  paper,  we  have  presented  a  unified  modelling 
approach  to  the  combined  evaluation  of  performance  and 
reliability  of  degradable/ repairable  fault- tolerant  systems.  The 
Krectcre  mate  process  of  tbe  system  is  modelled  is  s  CTMC ,  m 
which  transitions  occur  m  response  to  events  such  is  failure,  repair 
or  system  degradation.  A  reward  raw  (or  performance  measure)  b 
associated  with  each  structure  state. 

We  free  the  traasform  solution  for  the  distribution  of  the  job 
completion  tune,  sad  relate  tt  to  the  transform  solution  of 
perform  dulity  Thu  m  a  useful  dual  relationship,  since  it  tnahln 
as  to  derive  from  the  analysis  of  the  completion  time  other 
msemtrm  inch  as  pcrformability.  system  reliability,  up/down  time, 
sad  the  distribution  of  interval  availability 

We  have  developed  an  algorithm  for  tbe  numerical 
evahtaiico  of  the  distributions  of  pcrformability  and  the 
completion  tune  cf  a  job  from  their  corresponding  transform 
ssintsona  This  is  a  significant  step  towards  the  evaluation  of 
repairable  fault-tolerant  computer  syauma. 
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